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Abstract —The articulated and complex nature of human actions makes 
the task of action recognition difficult. One approach to handle this 
complexity is dividing it to the kinetics of body parts and analyzing 
the actions based on these partial descriptors. We propose a joint 
sparse regression based learning method which utilizes the structured 
sparsity to model each action as a combination of multimodal features 
from a sparse set of body parts. To represent dynamics and appearance 
of parts, we employ a heterogeneous set of depth and skeleton based 
features. The proper structure of multimodal multipart features are 
formulated into the learning framework via the proposed hierarchical 
mixed norm, to regularize the structured features of each part and to 
apply sparsity between them, in favor of a group feature selection. 
Our experimental results expose the effectiveness of the proposed 
learning method in which it outperforms other methods in all three tested 
datasets while saturating one of them by achieving perfect accuracy. 

Index Terms —Action recognition, Kinect, Joint sparse regression, 
Mixed norms, Structured sparsity, Group feature selection 
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I Introduction 

Human actions consist of simultaneous flow of different body 
parts. Based on this complex articulated essence of human 
movements, the analysis of these signals could be highly 
complicated. To ease the task of classification, actions could be 
broken down into their components. This is done by a body 
part detection on depth sequences of human body movements 

II [. Having the 3D locations of body joints in the scene, we 
can separate the complicated motion of body into a concurrent 
set of behaviors on major skeleton joints; therefore human 
action sequences could be considered as multipart signals. 
Throughout this paper, we use the term "part" to denote each 
body joint as defined in |1|. 

Limiting the learning into skeleton based features cannot 
deliver high levels of performance in action recognition, be¬ 
cause: (1) most of the usual human actions are defined based 
on the interaction of body with other objects, and (2) depth 
based skeleton data is not always accurate due to the noise 
and occlusion of body parts. To alleviate these issues, different 
depth based appearance features can be leveraged. The work 
in |2| proposed LOP (local occupancy patterns) around each 
of the body joints in order to represent 3D appearance of the 
interacting objects. Another solution is HON4D (histogram of 
oriented 4D normals) |3j, which gives more descriptive and 
robust models of the local depth based appearance and motion, 
around the joints. Based on the complementary properties of 
mentioned features, it is beneficial to utilize all of them as 
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different descriptors for each joint. Combining heterogeneous 
features of each part of the skeleton, leads into a multimodal- 
multipart combination, which demands sophisticated fusion 
algorithms. 

An interesting approach to handle the articulation of actions 
was recently proposed by |2|. As the key intuition, they have 
shown each individual action class can be represented by 
the behavior and appearance of few informative joints in the 
body. They utilized a data mining technique to find these 
discriminative sets of joints for each class of the available 
actions and tied up the features of those parts as "actionlets". 
They employed a multi-kernel learning method to build up 
ensembles of actionlets as kernels for action classification. This 
method is highly robust against the noise in depth maps, 
and the results show its strength to characterize the human 
body motion and also human-object interactions. However the 
downside of this approach is the inconsistency of their heuristic 
selection process (mining actionlets) with the following learn¬ 
ing step. Moreover, it simply concatenates different types of 
features for multimodal fusion, which is another drawback of 
this work. In this fashion, achieving the optimal combination of 
features regarding the classification task cannot be guaranteed. 

To overcome the limitations mentioned above, we propose 
a joint structured sparsity regression based learning method 
which integrates part selection into the learning process consid¬ 
ering the heterogeneity of features for each joint. We associate 
all the features for each part as a bundle and apply a group 
sparsity regularization to select a small number of active parts 
for each action class. To model the precise hierarchy of the 
multimodal-multipart features in an integrated learning and 
selection framework, we propose a hierarchical mixed norm 
which includes three levels of regularization over learning 
weights. To apply the modality based coupling over hetero¬ 
geneous features of each part, it applies a mixed norm with 
two degrees of "diversity" induction (4j, followed by a group 
sparsity among the feature groups of different parts to apply 
part selection. 

The main contributions of this paper are two-fold: First, we 
integrated the part selection process into our learning in order 
to select discriminative body parts for different action classes 
latently, and utilize them to learn classifiers. Second, a hierar¬ 
chical mixed norm is proposed to apply the desired simultane¬ 
ous sparsity and regularization over different levels of learning 
weights corresponding to our special multimodal-multipart 
features in a joint group sparsity regression framework. 

We evaluate our method on three challenging depth based 
action recognition datasets: MSR-DailyActivity dataset |2j, 
MSR-Action3D dataset (5}, and 3D-ActionPairs dataset |;3J. 
Our experimental results show that the proposed method is 
superior to other available methods for action recognition on 
depth sequences. 

The rest of this paper is organized as follows: Section 2 
reviews the related works on depth based action recognition, 
joint sparse regression, mixed norms, and multitask learning. 
Section 3 presents the proposed integrated feature selection and 
learning scheme. It also introduces the new multimodal-multi 
part mixed norm which applies regularization and group spar¬ 
sity into the proposed learning model. Experimental results on 
three above-mentioned benchmarks are covered in section 4 
and we conclude the paper in section 5. 
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2 Related Work 

Visual features extracted from depth signals can be classified 
into two major classes. The first are skeleton based features, 
which extract information from the provided 3D locations of 
body joints on each frame of the sequence. Essentially, skele¬ 
tons have a very succinct and highly discriminative represen¬ 
tation of the actions. )61 utilized them to extract "eigenjoints" 
for action classification using a naive-bayes-nearest-neighbor 
classifier. In 0 spherical histograms of 3D locations of the 
joints went through HMM to model the temporal changes and 
final action classification. Presence of noise in depth maps and 
occlusion of body parts bounds the reliability of this type of 
features. Another major deficiency of skeleton data is their 
incapacity to represent the interactions of the body with other 
objects which is crucial for activity interpretation. 

The other group, consists of features which are extracted 
directly from depth maps. Most of the features in this class 
consider depth maps as spatio-temporal signals and tried to 
extract local or holistic descriptions from input sequences. [5j 
proposed a depth based action graph model in which each 
node indicates a salient posture and actions were represented 
as paths through graph nodes. To deal with occlusion and 
noise issues in depth maps, (8) proposed "random occupancy 
pattern" features and applied an elastic-net regularization 
[91 to find the most discriminative subset of features for 
action recognition. STIP (space-time interest point) detection 
described by HOG (histogram of oriented gradients) 110] and 
HOF (histogram of optical flow) was originally proposed for 
recognition purposes on RGB videos 1111, but |12| showed this 
could be easily generalized into RGB+D signals. To improve the 
discrimination of descriptors, they generalized the idea of "mo¬ 
tion history images" 1131 over depth maps. Noise-suppression 
could also boost up the performance of STIP detection on depth 
sequences (14) . Four dimensional surface normals were shown 
to be very powerful representations of body movements over 
depth signals |3|. This idea was a generalization of HOG3D 
1151 into four dimensional depth videos. They quantized the 
4D normal vectors of depth surfaces by taking their histograms 
over the vertices of a 4D regular polychoron, which were 
shown to be highly informative for action classification. 

Regarding the strengths and weaknesses of aforementioned 
classes of features, we infer they are complementary to each 
other and to achieve higher levels of performance, we have to 
combine them. |2| used histograms of 3D point clouds around 
the joints (LOP) to be added into skeleton based features for 
action classification using an "actionlet ensemble" framework. 
(16) added local HON4D 0 into joint features to learn a max- 
margin temporal warping based action classifier. We utilize 
skeletons, LOP and HON4D as state-of-the-art depth based 
features to build up our multimodal input for the task of action 
recognition. 

The main intuition behind the work of (2] was the fact 
that features of few informative joints are good enough for 
recognizing each class of the actions. They defined "actionlet" 
as the combination of features of a limited numbers of joints 
and based on the discriminative power of each joint and each 
actionlet, they performed a data mining procedure to find the 
best actionlets for each class of the actions. They used mined 
actionlets as kernels in a multi-kernel multiclass SVM. We 
further extend this idea by applying group sparsity in a joint 
feature selection framework. To do so, we group the features 
of each part (joint) and applied L 1 norm between these groups 


to achieve a sparse set of active parts to represent each action 
class. 

Mixed norms are powerful tools to inject simultaneous spar¬ 
sity and coupling effects between the learning coefficients. 
They have been studied in a variety of fields. In statistical 
domain, (17) proposed the "group Lasso", as an extension over 
"Lasso" 1181 for a grouped variable selection in regression. 
1191 introduced "composite absolute penalty" for hierarchical 
variable selection. "Hierarchical penalization" is also proposed 
to utilize prior structure of the variables for a better fitting 
model (20). In sparse regression, mixed norms have been used 
as regularization terms to link sparsity and persistence of 
variables [21]. A generalized shrinkage scheme was proposed 
by (22) for structured sparse regression. (23) used mixed norms 
as structured sparsity regularizers for heterogeneous feature 
fusion, and |24) extended this idea for a multi-view clustering. 
1251 proposed a robust self-taught learning using mixed norms 
and (26) utilized a fractional mixed norm for robust adaptive 
dictionary learning. In this paper, to regularize the multimodal 
features of each part, we apply a mixed L 2 / L A norm. To 
achieve the sparsity between parts, we generalize this into an 
L 1 /L 2 /L 4 hierarchical norm. 

If multiple learning tasks at hand share some inherent con¬ 
stituents or structures, "Multitask Learning" (27) techniques 
could be globally beneficial. In joint sparse regression, multi¬ 
task learning is formulated by a mixed norm. 0 proposed 
an L 1 /L°° norm to add this into Lasso for variable selection. 
In joint feature selection, L 1 /L 2 norm can provide multitask 
learning by applying selection between the L 2 regularized 
parameters of each feature (29) . Same is used in (30) as a 
generalization of L 1 norm in a multitask joint sparsity repre¬ 
sentation model to fuse complementary visual features across 
recognition tasks. (3l) studied different mixed norms when 
they applied multitask sparse learning in visual tracking and 
based on their experimental results, they showed L 1 / L 2 is 
superior among them. In this work, we use a similar norm 
to utilize the shared latent factors between different binary 
action classifiers. We apply L 2 regularization over the weights 
corresponding to each feature across all the tasks, followed by 
an L 1 between all the features at hand. 


3 Multimodal Multipart Learning 

Notations 

Throughout this paper, we use bold uppercase letters to repre¬ 
sent matrices and bold lowercase letters to indicate vectors. For 
a matrix X, we denote its j-th row as x. J and its i-th column 
as x, . 

Assume the partition £ is defined over a vector z to divide 
its elements into |£| disjoint sets. We use to represent the 
indices of i-th set in £, and its corresponding elements in z are 
referred to as z ?i , also z^ i,k represents the k-th element of z'‘. 
The 17 /L q norm of z regarding £ is represented by ||z || 9iP | 5 
and is defined as the L q norms of the elements inside each set 
of £ followed by an L p norm of the L q values across the sets; 
mathematically: 

i/p 


(1) 

in which £,| indicates the cardinality of set £,:■ 
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Now consider the elements of each set £, are further par¬ 
titioned by operator p into \p\ disjoint subsets. Similarly, we 
indicate j-th p-subset of i-th £-set of z as z^ i,p J and z^ i,p3 ’ k 
represents its fc-th element. The L p /L q / L r norm of z regarding 
£ and p is also represented by |M|r,g, P | P ,e and is defined as the 
L q /L r norms (regarding p) of all |£| sets followed by an L v 
norm of the L 9 /L r values across the sets of £; mathematically: 
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( 2 ) 


This representation can be easily extended into higher orders 
of structural mixed norms by further partitioning the subsets. 


3.1 Multipart Learning by Structured Sparsity 

Our purpose of learning is to recognize the actions in depth 
videos, based on depth based and skeleton based features 
extracted. The set of input features we use to describe each 
action sample is a combination of multimodal multipart fea¬ 
tures. The entire body is separated into a number of parts 
(as illustrated in FigJlJ and for each part we have different 
types of features to represent the movement and local depth 
appearance. Therefore, our input feature set for each input 
sample, can be represented by a vector: z £ R , which consists 
of feature groups of different parts and modalities. Assume 
operator n is partitioning z into P parts, and p, is defined 
over sets of 7r to further partition them based on M number 
of features modalities. So, the hierarchy of features inside this 
vector is indicated by: z = [z’ riT ,..., z’ rpT ] T , in which each 
z ni = [z M 1 ’ 7 riT ,...,z f ‘ M ' ,r ‘ T ] T . 

Now the problem of multiclass action recognition can be 
considered as multiple binary regression based classification 
problems in a one versus all manner. Given n training samples 
X = [xi,...,x n ] in which x, £ R d and their corresponding 
labels for C distinct classes: Y = [yi, ...,yc] with y c £ {0,1}" 
and Vi : Y^ =1 y‘ c = 1; we are looking for a projection 
matrix W* £ R dxC which minimizes a set of loss functions 
J c ((xi, w*}, yl) for all classes c £ {1,...,C} and samples 
i £ {1,..., nf. Our choice for the total loss function, without loss 
of generality, is sum of squared errors (Vc : J c (a, b) = (a — b ) 2 ). 

The most common shrinkage methods to regularize the 
learning weights against overfitting are to penalize L p norms 
of the learning weights for each class: 

n 

w * = argmin E ^ c ((xi,w c >, 2 /”) + A||w c || p (3) 

w <= i=i 

in which A is the regularization factor. Employing L 2 norm 
(p = 2) leads into a general weight decay and minimization 
of the magnitude of W, and applying L 1 norm (p = 1) yields 
simultaneous shrinkage and sparsity among the individual fea¬ 
tures. Such methods simply ignore the structural information 
between the features, which can be useful for classification; 
therefore, it is beneficial to embed these feature relations into 
our learning scheme via structured sparsity inducing mixed 
norms. 

In the context of depth based action recognition, features are 
naturally partitioned into parts. "Actionlet ensemble" method 
121 tried to discover discriminative joint groups using a data 
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Fig. 1. Three Levels of the Proposed Hierarchical Mixed Norm 
for Multimodal Multipart Learning. We combine two levels of 
regularization inside modality groups and between them for 
each part, followed by a sparsity inducing norm between the 
parts to apply part selection. 


mining process, which led into an interesting improvement 
on the performance; however, their heuristic selection process 
is discrete and separated from the following learning step. 
To address these issues, we propose to apply group sparsity 
to perform part selection and classification in a regression 
based framework, in contrast to the mining based joint group 
discovery of |2|. 

We know that the discriminative strength of features in each 
part are highly correlated regarding all the classes at hand. So 
we expect the corresponding learning parameters (elements of 
each w c ) to be triggered or halted concurrently within each set 
of 7r partitioning (for each action class). To apply a grouping 
effect on these features, we consider each set in 7r as a unit and 
measure its strength with an L 2 norm of the included learning 
weights. On the other hand, we seek a sparse set of parts to 
be activated for each class at hand, so we apply an L 1 norm 
between the L 2 values of the groups. Such an intuition can be 
formulated by an L x /L 2 mixed norm based on n for each class: 

n 

w* = argmin J c ((xi,w c ),yl) + A||w c || 2 ,i| T (4) 

w <= i=l 

Adding this up for all the action classes with the same regu¬ 
larization factor, we have: 

c n c 

W* = argmin^ ^J c ((x i ,w c ),j/’) + A^||w c || 2: i K 

W i i i 

C=I 4 = 1 C= 1 

= argminJ(X T W,Y) + A||i;ec(W)|| 2 , 1 , 1K , T (5) 

w 

in which vec(.) is the vectorization operator and r is the 
partitioning operator of vec( W) elements based on their cor¬ 
responding tasks (or columns here): V(k,c) : t(Wc) = c. We 
will refer to this multipart learning method as "MP". 

Minimization of |5| applies the desired grouping effect into 
the features of each part and guarantees the sparsity on the 
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number of active parts for each class in a smooth and simpler 
way, compared to the actionlet method. 

3.2 Multimodal Multipart Learning via Hierarchical Mixed 
Norm 

In the above formulation, we apply an L 2 regularization norm 
over heterogeneous features of all the modalities for each part, 
and ignore the modality structures between them. In other 
words, applying a general L 2 norm may cause the suppression 
of the information at some dimensions. These issues are more 
severe when training samples are limited (which is the case for 
action recognition in depth), in which it might lead to weak 
generalization of the learning. 

To overcome these limitations, we utilize L°° to regularize 
the coefficients inside each modality, so that "diversity" 1211 
can be encouraged. It is already known that the behavior of L p 
norm for p > 2 rapidly moves towards L°° (32) ; since L°° is 
not easy to optimize directly, we picked L 4 as the most efficient 
approximation of it. Higher order norms like L e apply the same 
effect but with a slightly more expensive processing cost. 

By applying the L 4 norm to regularize the weights in each 
modality group of each part, now we have a three-level 
L 1 /L 2 /L 4 mixed norm. Inner L 4 gives more "diversity" to 
regularize the features inside each partiality-modality subset. 
Lr norm employs a magnitude based regularization over the 
L 4 values to link different modalities of each part, and the outer 
L 1 applies the soft part selection between the L 2 /L 4 values of 
each action class (Fig0- 

Replacing the previous structured norm by the proposed 
hierarchical mixed norm in |5}, we have: 

c 

W* = argminJ(X T W ! Y) + AV||w c || 4i2il|(i ,, 

w ^ 

= argmin J(X T W, Y) + A||vec(W) || 4 ,2,i,i|p,7r,r (6) 

w 

here, tv indicates the partitioning of features based on their 
source body part, and p represents further partitioning of each 
part's set regarding the modalities of the features. In the rest 
of this paper, we use the abbreviation "MMMP" to refer to this 
method. It is worthwhile to note, changing the inner norm to 
L 2 will reduce the hierarchical norm into a two level mixed 
norm, i.e. ||'uec(W)|| 2 ,2,i,i| M ,,r,T = ||vec(W)|| 2 ,i,i| T ,T derived 
directly from the definition of hierarchical norm |2j. 

When different learning tasks have similar latent features, 
"Multitask Learning" j27j techniques can improve the perfor¬ 
mance of the entire system by applying information sharing 
between the tasks. Here we are learning classifiers for C differ¬ 
ent classes which essentially have lots of latent components in 
common, so pushing them to share some features is beneficial 
for the classification task. This can be done by applying an L 2 
grouping on all the weights corresponding to each individual 
feature. Each of these L 2 values represents the magnitude of 
strength for its corresponding feature among all the tasks. 
Then applying an L 1 over the magnitudes can apply a shared 
variable selection considering all the tasks. Adding the new 
multitask term into (6), we have: 

d 

W* = argmin J(X T W. Y) + Ai V ||w fc ||2 

w ti 

+A2||i>ec(W)|| 4i 2,i,i| M ,7r,T (7) 

= argmin J(X T W, Y) + Ai||i>ec(W)|| 2il ,* 

w 

( 8 ) 


here, d is the number of rows in W which is equal to the 
size of the entire feature vector, and cj> defines the partitioning 
of vec( W) elements based on their corresponding individual 
features: V(fc,c) : = k. 

Combining these two regularization terms can be considered 
as a trade off between sparsity and persistence of features 
1331 based on their relations across the parts, modalities, and 
between the action classes. 

In our experiments, we use P = 20 body joints as parti¬ 
tioning operator tv. Since each column of W has the same 
hierarchical partitioning as input features: W = [w 4 ], in which 
c counts the number of classes and j counts the feature groups 
for P joints. The features for each joint come from M = 3 
different modalities: skeletons, LOP, and HON4D; this defines 
the p operator. Therefore, each wj! = [w 4,1 ,...,w 4,J ' / ] T , in 
which each w| m is the corresponding weight elements to class 
c, joint j and modality m. This way (8j will be expanded to: 

d 

W* = argmin ||X T W — Y||| + Ai ||w fc || 2 

w t^i 

C P M 

+ *«££(£ iiwnii) 172 ( 9 ) 

C= 1 j = 1 771 = 1 

3.3 Two Step Learning Approach 

The downside of current formulation is the large number of 
weights to be learned simultaneously, compared to the size 
of training samples which are highly limited in current depth 
based action recognition benchmarks. To resolve this, we first 
learn the partially optimum weights for multipart features 
of each modality separately and then fine-tune them by the 
proposed multimodal multipart learning. 

To learn the partially optimum weights for each modality m, 
we optimize: 

W m = argmin J(Xj,W m ,Y) + Ai || vec(’W m ) || 2 ,i|<p 
w m 

+A 2 ||t»ec(W m )|| 2 , 1>1 | WiT (10) 

After achieving the partially optimum point for each modal¬ 
ity, we merge the W m values for all M modalities: 

W = [W|,..., Wm] t (11) 

Next is to fine-tune the weights in the multimodal-multipart 
learning fashion, on a neighborhood of W values. To do so, 
we expect the global optimum weight not to diverge too much 
from their partially optimal points: 

W* = argmin J(X T W, Y) + Ai||uec(W)|| 2il |* 

w 

+A 2 1|vec(W)|| 4l2 , 1 , 1 | /1 ,, r , T + A 3 ||W - W||| (12) 

The last term in (l2| will limit the deviation of learning 
weights from their partially optimal point, as we expect them 
to be just fine-tuned in this step. 

Upon optimization over training data, the detection of the 
learned classifier for each testing sample x, can be obtained 
by: 

/(xi) = argmax (x,, w*) (13) 

C 

The optimization steps are all done by "L-BFGS" algorithm 
using off-the-shelf "minFinc" tool |34|. 


+A 2 II vec(W)|| 4j 2 l i, 1 | ftlT|T 
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TABLE 1 

Subject-wise Cross-Validation Performance Comparison of the 
Proposed Hierarchical Mixed Norm with Plain and Multipart 
Group Sparsity Norm on the MSR-DailyActivity Dataset 


Method 

Structure/Hierarchical Norm Used 

Accuracy 

L 2 

||i>ec(W)||| 

80.61±2.49% 

MP 

l|fec(W)|| 2 ,i,i| 7riT 

81.55±2.43% 

MMMP 

||«ec(W)|| 4j 2 , 1 , 1 | M , 7ri T 

84.03±2.16% 


4 Experiments 

This section describes our experimental setup details and then 
provides the results of the proposed method on three depth 
based action recognition benchmarks. 

4.1 Experimental Setup 

All the provided experiments are done on Kinect based 
datasets. Kinect captures RGB frames, depth map signals and 
3D locations of major joints. To have a fair comparison with 
other depth based methods, we ignore the RGB signals. Skele¬ 
ton extraction is done automatically by Kinect's SDK based on 
the part-based human pose recognition system of |1|. On each 
frame, we have an estimation of 3D positions of 20 joints in the 
body. All of our features are defined based on these joints as 
the multipart partitioning operator (tv); therefore, each feature 
necessarily belongs to one of these parts. 

To represent skeleton based features, first we normalize the 
3D locations of joints against size, position and direction of 
the body in the scene. This normalization step eases the task 
of comparison between body poses. On the other hand, the 
extracted body locations and directions could also be highly 
discriminative for some action classes like "walking" or "lying 
down"; therefore we add them into the features under a 
new auxiliary part. To encode the dynamics of skeleton based 
features, we apply "Fourier temporal pyramid" as suggested 
by (2) and keep first four frequency coefficients of each short 
time Fourier transformation. This leads into a feature vector of 
size 1,876 for each action sample. 

In addition to skeleton based features, other modalities we 
use are local HON4D J3) and LOP 121 to represent depth based 
local dynamics and appearance around each joint. On each 
frame, LOPs are extracted on a (96,96,320)-sized depth neigh¬ 
borhood of each joint, which is divided into 3x3x4 number of 
(32,32,80)-sized bins. To represent LOP based kinetics, we use 
a similar Fourier temporal pyramid transformation. HON4D 
features are also extracted locally over the location of joints on 
each frame. We encode HON4D features using LLC (locality- 
constrained linear coding) j35[ to reduce their dimensionality 
while preserving the locality of 4D surface normals. Dictionary 
size of 100 is picked for the clustering step. LLC codes go 
through a max pooling over a 3 level temporal pyramid. 
Dimension of the features for LOP and HON4D are 5,040 
and 14,000 respectively. The overall dimensionality of input 
features for each sample is 20,916. 

4.2 MSR-DailyActivity3D Dataset 

According to its intra-class variations and choices of action 
classes, MSR-DailyActivity dataset |2j, is one of the most chal¬ 
lenging benchmarks for action recognition in depth sequences. 
It contains RGB, depth, and skeleton information of 320 action 


TABLE 2 

Performance Comparison of the Proposed Method Using 
Plain/Structured/Hierarchical Norms on the Standard 
Evaluation Split of the MSR-DailyActivity Dataset 


Method 

Structure/Hierarchical Norm Used 

Accuracy 

L 1 

||nec(W)||i 

86.88% 

L 2 

H-uec(W) ||| 

87.50% 

MP 

||vec(W)|| 2 ,i,i| 7r , r 

88.13% 

MMMP 

||t)ec(W)||4 >2jljl | M;7rj T- 

91.25% 


TABLE 3 

Performance Comparison on the Standard Evaluation Split of 
the MSR-DailyActivity Dataset using Single Modality and 
Multimodal Features. 


Method 

Modalities 

Accuracy 

Actionlet Ensemble ji| 

LOP 

61% 

Proposed MP 

LOP 

79.38% 

Orderlet Mining |36j 

Skeleton 

73.8% 

Actionlet Ensemble |2 


Skeleton 

74% 

Proposed MP 

Skeleton 

79.38% 

Local HON4D j^[ 

HON4D 

80.00% 

Proposed MP 

HON4D 

81.88% 

Actionlet Ensemble 12 


Skeleton+LOP 

85.75% 

Proposed MMMP 

Skeleton+LOP 

88.13% 

MMTW |16j 

Skeleton+HON4D 

88.75% 

Proposed MMMP 

Skeleton+HON4D 

89.38% 

DSTIP [14] 

DCSF+LOP 

88.20% 

Proposed MMMP 

Skeleton+LOP+HON4D 

91.25% 


samples, from 16 classes of daily activities in a living room. 
Each activity is done by 10 distinct subjects in two different 
ways and evaluations are applied over a fixed cross-subject 
setting; first five subjects are taken for training and others for 
testing. Unlike other datasets, MSR-DailyActivity has a more 
realistic variation within each class. Subjects used both hands 
randomly to do the activities, and samples of each class are 
captured in different poses. 

First, to verify the strengths of our proposed hierarchical 
mixed norm, we evaluate the performance of the classification 
in a subject-wise cross-validation scenario. We evaluate the per¬ 
formance of the plain Lr norm, the multipart structured norm 
(MP), and the proposed hierarchical mixed norm (MMMP), 
in all 252 possible train/test splits of 5 out of 10 subjects. 
To have a proper comparison between these norms, we have 
not applied the multitask term. The results of this experiment 
are shown in Table [l] Adding part based grouping, when it 
ignores the modality associations between the features, can 
slightly improve the performance from 80.61% into 81.55%. 
By adding multimodality grouping and applying the proposed 
hierarchical mixed norm, improvement is more significant and 
reaches 84.03%. 

Next, we verify the results of our method by applying men¬ 
tioned norms on the standard train/test split of the subjects. 
As provided in Table [2] applying simple feature selection using 
a plain L 1 norm leads into 86.88% of accuracy. By applying a 
plain L 2 norm on all the features we get 87.50%. Multipart 
learning regardless of heterogeneity of the modalities leads 
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TABLE 4 

Average Cross Subject Performance for MSR-Action3D 
Dataset on Three Action Subsets of [5] 


Method (protocol of 

51) 

Accuracy 

Action Graph on Bag of 3D Points |5| 

74.7% 

Histogram of 3D Joints ]7 

79.0% 

Eigenjoints |6 

83.3% 

Random Occupancy Patterns 8 

86.5% 

Depth HOG j38 

91.6% 

Lie Group 39j 

92.5% 

JAS+HOG 2 j|40 

94.8% 

DL-GSGC+TPM [41 


96.7% 

Proposed MMMP 

98.2% 


into 88.13%. Finally by adding the multipart learning via the 
proposed hierarchical mixed norm we reach the interesting 
accuracy of 91.25% on this dataset. Applying higher orders for 
the inner-most norm (like L 1 /L 2 /L b ) achieved the same level 
of accuracy at a slightly higher processing time. 

To assess the strength of the proposed multipart learning, we 
evaluate our method on single modality setting using (To) . As 
shown in Table [3] on skeleton based features, we got 79.38% 
compared to 74% of the baseline actionlet method. Using LOPs, 
our method achieved 79.38% which is more than 18% higher 
than the actionlet's performance. For local HON4D features, 
we achieved 81.88% compared to 80.00% of the baseline local 
HON4D method. Now we use the partially learned weights 
of single modality multipart learning and employ them for 
the optimization of (12) to learn globally optimum projections. 
First we try the combination of skeleton based features with 
LOP. Using proposed learning, we get 88.13% of accuracy 
which outperforms the baseline's best result of 85.75%. 1161 
used skeleton and HON4D features in a temporal warping 
framework and got 88.75%. Our method outperforms it using 
the same set of features by achieving 89.38% of accuracy. And 
finally using all three modalities, our method achieves the 
performance level of 91.25%. Table [3] shows the complete set 
of results for this experiment. 

Our implementation is done in MATLAB, and not fully 
optimized for time efficiency. The average training and testing 
time of MMMP on a 3.2 GHz Core-i5 machine are 170 and 
2 x 10 -4 seconds respectively, with no parallel processing. 

It is worth pointing out some of the published works on this 
dataset applied other train/test splits, e.g. (37) reported 93.1% 
of accuracy on a leave-one-subject-out cross validation. On this 
setup, proposed MMMP method achieves 97.5%. 

4.3 MSR-Action3D Dataset 

MSR-Action3D |5| is another depth based action dataset which 
provided depth sequences and skeleton information of 567 
samples for 20 action classes. Actions are done by 10 different 
subjects, two or three times each. Evaluations are applied over 
another fixed cross-subject setting; Odd numbered subjects are 
taken for training and evens for testing. On one hand, depth 
sequences in this dataset have clean background which eases 
the recognition, and on the other hand, number of classes 
are higher than other datasets which could be a challenge for 
classification. 

The reported results on this dataset are divided in two dif¬ 
ferent scenarios. First is the average cross subject performance 


TABLE 5 

Performance Comparison for MSR-Action3D Dataset Over All 
Action Classes 


Method (protocol of 121) 

Accuracy 

Depth HOG (38) (as reported in jl6|) 

85.5% 

Actionlet Ensemble 2 

88.2% 

HON4D ji 

88.9% 

DSTIP 14 

89.3% 

Lie Group ( 39 ) 

89.5% 

HOPC j42] 

91.6% 

Max Margin Time Warping jl6} 

92.7% 

Proposed MMMP 

93.1% 


on three action subsets defined in (5), and second is the overall 
cross subject accuracy regardless of subsets, as done in (5). 
Following 139], we call them as protocols of (5| and |2|. Tables 
[4] and [5] show the results. Although we still have the highest 
accuracy among the reported results, the achieved margin is 
not as large as other datasets. This is because of the simplicity 
of actions in this dataset. Since there is not any interaction with 
other objects, most of the classes are highly distinguishable 
using skeleton only features; therefore our multimodality could 
not boost up the results that much, but the multipart learning 
still shows its advantage over other methods. 

4.4 3D Action Pairs Dataset 

To emphasize the importance of the temporal order of body 
poses on the meaning of the actions, (3| proposed 3D Action 
Pairs dataset. It covers 6 pairs of similar actions. The only 
difference between each pair is their temporal order so they 
have similar skeleton, poses, and object shapes. Each action is 
performed by 10 subjects, 3 times. First five subjects are taken 
for testing and others for training. Based on the fewer number 
of the action classes and absence of intra-class variations, this 
is the easiest benchmark among depth based action recogni¬ 
tion datasets and other methods already achieved very high 
accuracies on it. 

Here we apply our full multimodal multipart learning 
method using all three available modalities of features. As 
shown in Table 6j the proposed method, outperforms all others 
and saturates the benchmark by achieving the perfect perfor¬ 
mance level on this dataset. 

5 Conclusion 

This paper presents a new multimodal multipart learning 
approach for action classification in depth sequences. We show 
that a sparse combination of multimodal part-based features 
can effectively and discriminatively represent all the available 
action classes at hand. Based on the nature of the problem, we 
utilize a heterogeneous set of features from skeleton based 3D 
joint trajectories, depth occupancy patterns and histograms of 
depth surface normals and show the proper way of using them 
as multimodal features set for each part. 

The proposed method does the group feature selection, 
weight regularization, and classifier learning in a consistent 
optimization step. It applies the proposed hierarchical mixed 
norm to model the proper structure of multimodal multipart 
input features by applying a diversity norm over the coeffi¬ 
cients of each part-modality group, linking different modalities 
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TABLE 6 

Performance Comparison for 3D Action Pairs Dataset 


Method 

Accuracy 

Depth HOG 38 (as reported in 116 

) 

66.11% 

Actionlet Ensemble 2 (as reported in 

16]) 

82.22% 

HON4D 3 

96.67% 

Max Margin Time Warping 16 

97.22% 

HOPC [|42J 

98.33% 

Proposed MMMP 

100.0% 


of each part by a magnitude based norm, and utilizing a soft 
part selection by a sparsity inducing norm. 

The provided experimental evaluations on three challenging 
depth based action recognition datasets show the proposed 
method can successfully apply the structure of the input fea¬ 
tures into a concurrent group feature selection and learning 
and confirm the strengths of the suggested framework com¬ 
pared to other methods. 
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